Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import xgboost as xgb
'figure.figsize'] = (10, 5) plt.rcParams[
kakamana
January 21, 2023
The fundamental idea behind XGBoost—boosted learners—will be introduced to you in this module. After gaining an understanding of how XGBoost works, we’ll apply it to solving one of the most common classification problems in the industry: predicting when customers will stop being customers.
This Classification with XGBoost is part of Datacamp course: Extreme Gradient Boosting with XGBoost
This is my learning experience of data science through DataCamp
It’s time to create our first XGBoost model! We can use the scikit-learn .fit()
/ .predict()
paradigm that you are already familiar to build your XGBoost models, as the xgboost library has a scikit-learn compatible API!
Here, we’ll be working with churn data. This dataset contains imaginary data from a ride-sharing app with user behaviors over their first month of app usage in a set of imaginary cities as well as whether they used the service 5 months after sign-up.
Our goal is to use the first month’s worth of data to predict whether the app’s users will remain users of the service at the 5 month mark. This is a typical setup for a churn prediction problem. To do this, you’ll split the data into training and test sets, fit a small xgboost model on the training set, and evaluate its performance on the test set by computing its accuracy.
avg_dist | avg_rating_by_driver | avg_rating_of_driver | avg_inc_price | inc_pct | weekday_pct | fancy_car_user | city_Carthag | city_Harko | phone_iPhone | first_month_cat_more_1_trip | first_month_cat_no_trips | month_5_still_here | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3.67 | 5.0 | 4.7 | 1.10 | 15.4 | 46.2 | True | 0 | 1 | 1 | 1 | 0 | 1 |
1 | 8.26 | 5.0 | 5.0 | 1.00 | 0.0 | 50.0 | False | 1 | 0 | 0 | 0 | 1 | 0 |
2 | 0.77 | 5.0 | 4.3 | 1.00 | 0.0 | 100.0 | False | 1 | 0 | 1 | 1 | 0 | 0 |
3 | 2.36 | 4.9 | 4.6 | 1.14 | 20.0 | 80.0 | True | 0 | 1 | 1 | 1 | 0 | 1 |
4 | 3.13 | 4.9 | 4.4 | 1.19 | 11.8 | 82.4 | False | 0 | 0 | 0 | 1 | 0 | 0 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 avg_dist 50000 non-null float64
1 avg_rating_by_driver 49799 non-null float64
2 avg_rating_of_driver 41878 non-null float64
3 avg_inc_price 50000 non-null float64
4 inc_pct 50000 non-null float64
5 weekday_pct 50000 non-null float64
6 fancy_car_user 50000 non-null bool
7 city_Carthag 50000 non-null int64
8 city_Harko 50000 non-null int64
9 phone_iPhone 50000 non-null int64
10 first_month_cat_more_1_trip 50000 non-null int64
11 first_month_cat_no_trips 50000 non-null int64
12 month_5_still_here 50000 non-null int64
dtypes: bool(1), float64(6), int64(6)
memory usage: 4.6 MB
from sklearn.model_selection import train_test_split
# Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:, :-1], churn_data.iloc[:, -1]
# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
# Instantiate the XGBClassifier: xg_cl
xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)
# Fit the classifier to the training set
xg_cl.fit(X_train, y_train)
# Predict the labels of the test set: preds
preds = xg_cl.predict(X_test)
# Compute the accuracy: accuracy
accuracy = float(np.sum(preds == y_test)) / y_test.shape[0]
print("accuracy: %f" % (accuracy))
print("\nOur model has an accuracy of around 74%. Later we'll learn about ways to fine tune our XGBoost models")
accuracy: 0.758200
Our model has an accuracy of around 74%. Later we'll learn about ways to fine tune our XGBoost models
Your task in this exercise is to make a simple decision tree using scikit-learn’s DecisionTreeClassifier
on the breast cancer dataset.
This dataset contains numeric measurements of various dimensions of individual tumors (such as perimeter and texture) from breast biopsies and a single outcome value (the tumor is either malignant, or benign).
We’ve preloaded the dataset of samples (measurements) into X
and the target values per tumor into y
. Now, you have to split the complete dataset into training and testing sets, and then train a DecisionTreeClassifier
. You’ll specify a parameter called max_depth
. Many other parameters can be modified within this model, and you can check all of them out here.
from sklearn.tree import DecisionTreeClassifier
# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
# Instantiate the classifier: dt_clf_4
dt_clf_4 = DecisionTreeClassifier(max_depth=4)
# Fit the classifier to the training set
dt_clf_4.fit(X_train, y_train)
# Predict the labels of the test set: y_pred_4
y_pred_4 = dt_clf_4.predict(X_test)
# Compute the accuracy of the predictions: accuracy
accuracy = float(np.sum(y_pred_4 == y_test)) / y_test.shape[0]
print("Accuracy:", accuracy)
Accuracy: 0.9736842105263158
We’ll now practice using XGBoost’s learning API through its baked in cross-validation capabilities. As Sergey discussed in the previous video, XGBoost gets its lauded performance and efficiency gains by utilizing its own optimized data structure for datasets called a DMatrix
.
In the previous exercise, the input datasets were converted into DMatrix
data on the fly, but when you use the xgboost
cv
object, you have to first explicitly convert your data into a DMatrix
. So, that’s what you will do here before running cross-validation on churn_data
.
# Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:, :-1], churn_data.iloc[:, -1]
# Create the DMatrix from X and y: churn_dmatrix
churn_dmatrix = xgb.DMatrix(data=X, label=y)
# Create the parameter dictionary: params
params = {'objective':"reg:logistic", "max_depth":3}
# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params,
nfold=3, num_boost_round=5,
metrics="error", as_pandas=True, seed=123)
# Pint cv_results
print(cv_results)
# Print the accuracy
print(((1 - cv_results['test-error-mean']).iloc[-1]))
print("\ncv_results stores the training and test mean and standard deviation of the error per boosting round (tree built) as a DataFrame. From cv_results, the final round 'test-error-mean' is extracted and converted into an accuracy, where accuracy is 1-error. The final accuracy of around 75% is an improvement from earlier!")
train-error-mean train-error-std test-error-mean test-error-std
0 0.28232 0.002366 0.28378 0.001932
1 0.26951 0.001855 0.27190 0.001932
2 0.25605 0.003213 0.25798 0.003963
3 0.25090 0.001844 0.25434 0.003827
4 0.24654 0.001981 0.24852 0.000934
0.751480015401492
cv_results stores the training and test mean and standard deviation of the error per boosting round (tree built) as a DataFrame. From cv_results, the final round 'test-error-mean' is extracted and converted into an accuracy, where accuracy is 1-error. The final accuracy of around 75% is an improvement from earlier!
Now that you’ve used cross-validation to compute average out-of-sample accuracy (after converting from an error), it’s very easy to compute any other metric you might be interested in. All you have to do is pass it (or a list of metrics) in as an argument to the metrics parameter of xgb.cv()
.
Your job in this exercise is to compute another common metric used in binary classification - the area under the curve ("auc"
).
# Perform cross_validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params,
nfold=3, num_boost_round=5,
metrics="auc", as_pandas=True, seed=123)
# Print cv_results
print(cv_results)
# Print the AUC
print((cv_results["test-auc-mean"]).iloc[-1])
print("\nAn AUC of 0.84 is quite strong. As you have seen, XGBoost's learning API makes it very easy to compute any metric you may be interested in. Later, you'll learn about techniques to fine-tune your XGBoost models to improve their performance even further. For now, it's time to learn a little about exactly when to use XGBoost")
train-auc-mean train-auc-std test-auc-mean test-auc-std
0 0.768893 0.001544 0.767863 0.002819
1 0.790864 0.006758 0.789156 0.006846
2 0.815872 0.003900 0.814476 0.005997
3 0.822959 0.002018 0.821682 0.003912
4 0.827528 0.000769 0.826191 0.001937
0.8261911413597645
An AUC of 0.84 is quite strong. As you have seen, XGBoost's learning API makes it very easy to compute any metric you may be interested in. Later, you'll learn about techniques to fine-tune your XGBoost models to improve their performance even further. For now, it's time to learn a little about exactly when to use XGBoost